Programmable  Numerical  Function  Generators  for  Two-Variable  Functions 


Shinobu  Nagayama 


Jon  T.  Butler 


Tsutomu  Sasao 


Dept,  of  Computer  and 
Network  Engineering, 
Hiroshima  City  University, 
Hiroshima  731-3194,  Japan 


Dept,  of  Electrical  and 
Computer  Engineering, 
Naval  Postgraduate  School, 
CA  93943-5 121,  USA 


Dept,  of  Computer  Science 
and  Electronics, 

Kyushu  Institute  of  Technology, 
Iizuka  820-8502,  Japan 


Abstract 

This  paper  proposes  a  design  method  and  programmable 
architectures  for  numerical  function  generators  (NFGs)  of 
two-variable  functions.  To  realize  a  two-variable  function 
in  hardware,  we  partition  a  given  domain  of  the  given  func¬ 
tion  into  segments,  and  approximate  the  function  by  a  poly¬ 
nomial  in  each  segment.  This  paper  introduces  two  planar 
segmentation  algorithms  that  efficiently  partition  a  domain 
of  a  two-variable  function.  This  paper  also  introduces  two 
architectures  that  can  realize  a  wide  range  of  two-variable 
functions.  Our  architectures  allow  a  systematic  design  of 
two-variable  functions.  FPGA  implementation  results  show 
that,  for  a  complicated  function,  our  NFG  achieves  58%  of 
memory  size  and  39%  of  delay  time  of  a  circuit  designed 
using  one-variable  NFGs. 


1.  Introduction 

The  ability  to  compute  numerical  functions  at  a  high 
speed  is  important  in  many  applications,  including  3D  com¬ 
puter  graphics  and  digital  signal  processing  [11].  How¬ 
ever,  most  existing  methods  are  intended  only  for  one- 
variable  functions  [4,  9,  13-16],  and  only  a  few  methods 
exist  for  multi-variable  functions  [5,6, 17],  Since  these  pa¬ 
pers  [5, 6, 17]  present  hardwares  dedicated  to  specific  func¬ 
tions,  different  functions  need  different  design  methods.  As 
far  as  we  know,  systematic  design  method  for  generic  multi- 
variable  functions  has  never  been  presented. 

A  straightforward  design  method  for  arbitrary  multi- 
variable  function  is  to  use  a  single  memory  in  which  the 
address  is  a  combination  of  values  of  variables  and  the  con¬ 
tent  of  that  address  is  the  corresponding  value  of  function. 
This  method  is  fast,  but  requires  a  2m"-word  memory  to  im¬ 
plement  an  w-variable  function  with  n  bits  for  each  variable. 
Even  for  small  m  and  n,  this  method  is  impractical  because 
of  large  memory  size. 

To  produce  a  practical  implementation,  multi-variable 
functions  are  often  designed  using  combination  of  one- 


variable  function  generators,  multipliers,  and  adders  [5,6]. 
This  design  method  reduces  the  required  memory  size. 
However,  depending  on  the  function  implemented,  it  can 
produce  a  slow  implementation  because  of  its  complicated 
hardware  architecture.  Also,  complicated  hardware  archi¬ 
tecture  makes  error  analysis  harder.  That  is,  guaranteeing 
output  accuracy  becomes  harder. 

This  paper  proposes  a  systematic  design  method  for  two- 
variable  functions.  Since  our  design  method  is  based  on 
a  piecewise  polynomial  approximation,  hardware  architec¬ 
tures  are  simple  even  for  complicated  functions.  To  ap¬ 
proximate  a  given  function  using  piecewise  polynomials, 
this  paper  introduces  two  planar  segmentation  algorithms 
that  partition  a  given  domain  of  two-variable  function  effi¬ 
ciently.  This  paper  also  introduces  two  programmable  ar¬ 
chitectures  that  can  realize  a  wide  range  of  two-variable 
functions. 

The  rest  of  this  paper  is  organized  as  follows:  Section  2 
introduces  a  number  representation  and  the  decision  dia¬ 
grams  used  in  this  paper.  Section  3  presents  two  planar  seg¬ 
mentation  algorithms.  Section  4  presents  two  architectures 
for  two-variable  functions.  Section  5  evaluates  performance 
of  our  segmentation  algorithms  and  architectures  for  spe¬ 
cific  two-variable  functions.  And,  Section  6  concludes  the 
paper.  Error  analysis  for  our  NFGs  is  omitted  because  it  is 
the  almost  same  as  [12, 15]. 

2.  Preliminaries 

2.1.  Number  Representation  and  Errors 

Definition  1  A  value  X  represented  by  the  binary  fixed- 
point  representation  is  denoted  by 

X  =  (xj—  1  Xl-2  ...  x\  -Vo-  -*-l  X-2  •  •  •  X-m), 

where  x,-  G  {0, 1},  l  is  the  number  of  bits  in  the  integer  part, 
and  m  is  the  number  of  bits  in  the  fractional  part.  Each  bit 
xt  contributes  2 'xt  to  the  value  ofX  except,  x/_ \,  which  con¬ 
tributes  —2l~lxi-i.  That  is,  the  fixed-point  representation 
is  in  two’s  complement. 
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Definition  2  Error  is  the  absolute  difference  between  the 
exact  value  and  the  value  produced  by  the  hardware.  Ac¬ 
ceptable  error  is  the  maximum  error  that  an  NFG  may  as¬ 
sume;  it  is  usually  a  specification  to  be  satisfied  by  the  hard¬ 
ware.  Approximation  error  is  the  error  caused  by  a  func¬ 
tion  approximation.  Acceptable  approximation  error  is  the 
maximum  approximation  error  that  a  function  approxima¬ 
tion  may  assume.  Rounding  error  is  the  error  caused  by  a 
binary  fixed-point  representation. 

Definition  3  Accuracy  is  the  number  of  bits  in  the  frac¬ 
tional  part  of  a  binary  fixed-point  representation,  m-bit  ac¬ 
curacy  specifies  that  m  bits  are  used  to  represent  the  frac¬ 
tional  part  of  the  number.  When  the  maximum  error  is  2~m, 
the  accuracy  is  no  greater  than  1  unit  in  the  last  place 
(ULP)  [11].  In  this  paper,  an  m-bit  accuracy  NFG  is  an 
NFG  with  an  m-bit  fractional  part  of  the  inputs,  an  m-bit 
fractional  part  of  the  output,  and  a  1  ULP  error. 

2.2.  Decision  Diagrams 

Definition  4  A  binary  decision  diagram  (BDD)  [2, 10]  is 
a  rooted  directed  acyclic  graph  (DAG)  representing  a  logic 
function.  The  BDD  is  obtained  by  recursively  applying  the 
Shannon  expansion  f  =  xifo  +  x,f\  to  the  logic  function.  It 
consists  of  two  terminal  nodes  representing  function  values 
0  and  1  respectively,  and  non-terminal  nodes  labeled  by  in¬ 
put  variables.  Each  non-terminal  node  has  two  unweighted 
outgoing  edges,  0-edge  and  1  -edge,  that  correspond  to  the 
values  of  the  input  variable.  The  terminal  nodes  have  no 
outgoing  edges.  We  consider  only  ordered  BDDs,  where  the 
order  of  the  variables  is  the  same  for  every  path  from  the 
root  node  to  a  terminal  node.  We  consider  only  reduced 
BDDs,  where  identical  subtrees  are  combined  into  a  single 
tree. 

Definitions  A  multi-terminal  BDD  (MTBDD)  [3]  is  an 

extension  of  a  BDD,  that  represents  an  integer-valued  func¬ 
tion:  {0, 1}"  — >  Z,  where  Z  is  a  finite  set  of  integers.  In  the 
MTBDD,  the  terminal  nodes  are  labeled  by  the  values  ofZ. 

Definition  6  An  edge-valued  BDD  (EVBDD)  [7,8]  is  also 
an  extension  of  a  BDD,  that  represents  an  integer-valued 
function.  The  EVBDD  is  obtained  by  repeatedly  applying 
the  expansion  f  =  xifo  +  x, (/[  +  ot)  to  the  integer-valued 
function,  where  f\  =  f[  +  a,  and  a  is  the  constant  term  of  f\ . 
In  the  EVBDD,  each  1-eclge  has  an  integer  weight  a  and  all 
0 -edges  have  weight  0.  There  is  only  one  terminal  node;  it 
is  labeled  0.  The  incoming  edge  into  the  root  node  can  have 
a  non-zero  weight.  For  example,  a  non-zero  weight  a  on  the 
incoming  edge  of  the  root  node  adds  a  to  all  sums  associ¬ 
ated  with  the  EVBDD.  Indeed,  it  occurs  when  the  EVBDD 
is  a  sub-EVBDD  to  a  larger  EVBDD. 

Example  1  Fig.  1(b)  and  (c)  show  an  MTBDD  and  an 
EVBDD  for  the  integer-valued  function  f  defined  by 
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(a)  Function  table. 


Figure  1.  MTBDD  and  EVBDD  for  an  integer¬ 
valued  function. 


Fig.  1(a).  In  Fig.  1(b)  and  (c),  dashed  lines  and  solid  lines 
denote  0-edges  and  1-edges,  respectively.  Note  that  the 
EVBDD  has  weighted  1-edges.  In  the  MTBDD,  terminal 
nodes  represent  function  values.  Thus,  to  evaluate  the  func¬ 
tion,  we  traverse  the  MTBDD  from  the  root  node  to  a  ter¬ 
minal  node  according  to  the  input  values,  and  obtain  the 
function  value  (an  integer)  from  the  terminal  node.  On  the 
other  hand,  in  the  EVBDD,  we  obtain  the  function  value  by 
summing  the  weights  of  the  edges  traversed  from  the  root 
node  to  the  terminal  node.  ( End  of  Example) 

3.  Piecewise  Polynomial  Approximation  Based 
on  Planar  Segmentation 

3.1.  Planar  Segmentation  Problem 

To  approximate  a  given  two-variable  function  by  piece- 
wise  polynomials,  we  need  to  partition  a  given  domain  of 
the  function  into  segments.  The  domain  of  a  two-variable 
function  consists  of  planar  segments,  and  requires  a  planar 
segmentation  algorithm.  The  memory  size  and  speed  of  an 
NFG  are  strongly  dependent  on  the  efficiency  of  the  seg¬ 
mentation  algorithm.  Thus,  effective  planar  segmentation 
algorithms  are  important  to  design  fast  and  compact  NFGs. 
To  produce  an  optimum  segmentation,  we  consider  the  fol- 


Input:  Numerical  function  f(X,Y),  domain  {[Xb,Xe),  [Yb .  Ye ) }  for  X  and  Y,  accuracy  m;>,  of  X  and  Y,  polynomial  order  d ,  and  accept¬ 
able  approximation  error  £a. 

*Requirement:  X  and  Y  are  represented  in  the  same  number  of  bits. 

Output:  Segments  {[Xh,P0),[Yh,Q0)},{[Xh,P0),  [go,  Q\ )}...,  {[P,-  i,Xe),  [Qr-\,Ye]},  and  correction  values  vp,  vi,  ■  ■  ■ ,  v^-i  ■ _ 

Step: 

1 .  For  { \Xb , Xe ) ,  [Yb ,  Ye ) } ,  compute  an  approximate  polynomial  gl/(X,Y). 

2.  Compute  the  maximum  positive  error  max/g  =  max{/(X,  Y)  —  gj (X, T)}. 

3.  Compute  the  maximum  negative  error  mirifg  =  min{/(X,F)  —  gj(X,Y)}. 

4.  Compute  approximation  error  £j  =  (maxfg  —mirifg)/ 2  and  correction  values  v  =  ( maxfg  +  mirifg)/2. 

5.  If  £j  <  £a  or  (Xe  —X/,)  <  2 then  stop. 

6.  Else,  partition  {[Xb,Xe),  [Yb,Ye)}  into  four  segments  {[Xb,P),  [Yb,Q)},  {{Xb,P),  [ Q,Ye )},  {[PXe),  [Yb,Q)},  and  {[P,Xe),  [Q,Ye)}, 
where  P  =  ( Xb  +Xe) /2  and  Q  =  (Yb  +  Ye)/ 2. 

7.  Repeat  Steps  1,  2, ...  ,6  for  each  new  segment  recursively,  until  the  maximum  approximation  errors  are  smaller  than  £a  in  all 
segments. 


Figure  2.  Recursive  planar  segmentation  algorithm. 


lowing: 

1 .  number  of  words  in  the  coefficients  memory,  which  is 
the  number  of  segments,  and 

2.  complexity  of  the  segment  index  encoder,  which  maps 
values  of  X  and  Y  to  a  segment  number. 

Fewer  segments  are  preferred  because  the  number  of  seg¬ 
ments  directly  affects  memory  size  of  the  NFG.  The  com¬ 
plexity  of  the  segment  index  encoder  is  also  important. 
Even  if  the  number  of  segments  is  minimum,  a  large  NFG 
is  produced  if  the  segment  index  encoder  is  very  large.  Es¬ 
pecially,  planar  segmentations  tend  to  require  significantly 
more  complex  segment  index  encoders  than  linear  segmen¬ 
tations.  Thus,  planar  segmentation  algorithms  considering 
these  two  parameters  are  essential  to  the  design  of  fast  and 
compact  NFGs.  Also,  the  complexity  of  segmentation  algo¬ 
rithms  should  be  considered  in  order  to  reduce  design  time. 

The  next  subsection  presents  two  heuristic  planar  seg¬ 
mentation  algorithms. 

3.2.  Planar  Segmentation  Algorithms 

We  first  present  a  recursive  planar  segmentation  algo¬ 
rithm  to  reduce  the  hardware  complexity  of  both  the  coef¬ 
ficients  memory  (the  number  of  segments)  and  the  segment 
index  encoder. 

We  provide  a  geometric  explanation  for  piecewise  planar 
(lst-order)  polynomial  approximation.  Approximating  a 
two-variable  function  is  accomplished  with  parallelograms 
that  project  onto  squares  on  the  X-Y  plane.  First,  a  (large) 
single  parallelogram  is  used  to  approximate  the  entire  given 
function.  It  projects  onto  the  X-Y  plane  as  a  square  with 
corners  at  ( XblYb ),  ( Xb,Ye ),  ( Xe,Yb ),  and  ( Xe,Ye ),  where 
Xe  —Xb  =  Ye  —  Yb.  The  parallelogram’s  orientation  and  al¬ 
titude  are  chosen  to  minimize  the  maximum  error.  If  this 
maximum  error  exceeds  the  given  acceptable  error,  the  fol¬ 
lowing  process  is  repeated.  The  projected  square  is  divided 


into  four  squares  each  one  fourth  the  area  of  the  original 
square.  This  square  is  said  to  be  quadsected.  In  each  of  the 
four  sections,  a  parallelogram  is  determined  that  approxi¬ 
mates  the  function  with  the  smallest  maximum  error.  If  that 
error  exceeds  the  given  acceptable  error,  that  square  is  quad¬ 
sected,  and  the  process  repeated.  The  process  stops  when  all 
square  areas  are  approximated  by  a  parallelogram  to  within 
the  given  acceptable  error.  It  follows  that,  in  areas  where 
the  function  varies  rapidly,  small  squares  are  used,  and,  in 
areas  where  the  function  is  nearly  planar,  large  squares  are 
used. 

Fig.  2  shows  this  algorithm.  Note  that  this  algorithm 
can  apply  to  polynomial  approximation  with  any  degree. 
The  inputs  are  a  numerical  function  f(X,Y),  a  domain 
{[Xb,Xe),  [Yb,Ye)}  for  X  and  Y ,  an  accuracy  mln  of  X  and 
Y ,  a  polynomial  order  d,  and  an  acceptable  approximation 
error  £„.  Then,  this  algorithm  produces  segments  by  recur¬ 
sively  partitioning  a  segment  into  four  equal- sized  square 
segments  until  achieving  the  acceptable  approximation  er¬ 
ror  £a  in  all  segments.  Note  that  this  algorithm  creates  a 
segment  of  size  viy  x  vty,  where  Wj  =  2hi  x  2~m’n  and  hi  is  an 
integer.  That  is,  all  the  segmentation  points  /(  and  <2,  are  re¬ 
stricted  to  values  of  which  the  least  significant  /?,-  bits  are  0 
(i.e.,  Pi  ={...  p-j+i  p-j  00  ...  0),  where  j  =  min  -  hi).  In 
Fig.  2,  the  approximating  polynomial  gf(X,Y)  is  obtained 
by  a  Taylor  expansion  of  f(X,Y)  at  the  center  ( u,v )  of  the 
segment: 


gd(X,Y)  =  f(u,v)  + 


f{u,v ) 


+ 


Z(»a) 

2! 


+  ...+ 


Z(»t) 

d\ 


where  s  =  X  —  u,t  =  Y  —  v,u=  (Bx  +  Ex) /2,  and  v  =  ( By  + 
Ey)/2.  To  reduce  the  approximation  error,  the  maximum 
positive  error  maxfg  and  the  maximum  negative  error  minfg 
are  equalized  by  a  vertical  shift  of  g(i(X,Y)  with  correc¬ 
tion  value  v.  Thus,  the  approximation  error  is  ( maxfg  — 


X  Y  X  Y 


P(X,Y)  P(X,Y) 

(a)  Architecture  for  recursive  (b)  Architecture  for  uniform 
segmentation.  segmentation. 

Figure  3.  Two  architectures  for  two-variable 
NFGs  based  on  planar  approximation. 


minjg)  /2,  and  the  approximating  polynomial  used  for  the 
NFG  is  gd(X,Y)  +  v. 

Next,  we  present  the  planar  uniform  segmentation  algo¬ 
rithm.  Since  the  recursive  planar  segmentation  algorithm 
produces  non-uniform  segmentation,  the  segment  index  en¬ 
coder  is  needed  to  compute  a  segment  number  from  values 
of  X  and  Y.  However,  in  a  uniform  segmentation  where 
the  number  of  segments  is  a  power  of  2,  the  segment  in¬ 
dex  encoder  is  not  necessary  because  a  segment  number 
is  obtained  by  the  most  significant  bits  of  X  and  Y  (see 
Fig.  3(b)).  This  eliminates  the  delay  of  the  segment  index 
encoder,  and  produces  fast  NFGs.  To  produce  uniform  seg¬ 
mentation,  we  begin  by  finding  the  smallest  square  segment 
needed  to  achieve  the  acceptable  approximation  error  using 
the  recursive  segmentation  algorithm  shown  in  Fig.  2.  Then, 
we  partition  a  given  domain  into  square  segments  with  the 
same  size  as  the  smallest  segment. 

4.  Architectures  for  Two-Variable  Numerical 
Function  Generators 

4.1.  Architectures  Based  on  Recursive  and 
Uniform  Segmentations 

For  each  segment  {[Bx,Ef),  [By,Ey)}  produced  by  a  pla¬ 
nar  segmentation  algorithm,  we  compute  the  approximation 
to  f(X,Y)  as  a  polynomial  P(X .  Y )  that  is  a  Taylor  expan¬ 
sion  with  a  correction  value.  Expanding  and  rearranging  the 
polynomial  yields 

P(X,Y)  =  C0  +  Cx(X  —  Bx)  +Cy(Y  —  Bf)  (1) 

+Qv(X  -  Bf)  (Y  —  By)  +  Cx2  {X  -  Bf)2 
+  Cy2(Y  -  Bf2  +  .  .  .  +  CyfY  -  By)d. 

Fig.  3  shows  two  architectures  for  two-variable  NFGs  re¬ 
alizing  (1)  which  use  piecewise  planar  approximation  (only 


(a)  Segment  index  function.  1 


(b)  LUT  cascade  and 
adders  [13]. 

Figure  4.  Segment  index  encoder. 


the  first  three  terms  of  (1).  Expanding  these  architectures  to 
a  polynomial  approximation  with  higher  degree  is  straight¬ 
forward.  Fig.  3(a)  and  (b)  show  architectures  based  on  re¬ 
cursive  segmentation  and  uniform  segmentations,  respec¬ 
tively.  The  segment  index  encoder  converts  values  of  X  and 
Y  into  a  segment  number.  This,  in  turn,  is  applied  as  the 
address  input  of  the  Coefficients  Memory.  The  coefficients 
are  applied  to  adders  and  multipliers  to  form  the  polyno¬ 
mial  value  P(X .  Y).  Note  that  Fig.  3(a)  uses  bitwise  ANDs 
to  compute  X  —  Bx  and  Y  —  By.  In  recursive  segmentation, 
we  can  realize  X  —  Bx  and  Y  —  By  using  AND  gates  driven 
on  one  side  by  Bx  and  By,  respectively  [13]. 

Note  that  Fig.  3(b)  has  neither  a  segment  index  encoder 
nor  bitwise  ANDs.  In  uniform  segmentation,  the  segment 
index  encoder  and  bitwise  ANDs  are  not  necessary  because 
a  segment  number,  X  —  Bx,  and  Y  —  By  are  obtained  by 
the  most  significant  bits  and  the  least  significant  bits  of  X 
and  T,  respectively.  Since  modern  FPGAs  have  logic  el¬ 
ements,  synchronous  memory  blocks,  and  dedicated  mul¬ 
tipliers,  these  architectures  are  efficiently  implemented  by 
those  hardware  resources  in  an  FPGA. 

4.2.  Architecture  and  Design  Method  for 
Segment  Index  Encoder 

The  segment  index  encoder  realizes  the  segment  index 

function:  {0,1}”  x  {0,1}"  — >  {0,1 _ ,k—  1}  shown  in 

Fig.  4(a),  where  X  and  Y  have  n  bits,  and  k  denotes  the 
number  of  segments.  We  realize  this  function  with  the  ar¬ 
chitecture  shown  in  Fig.  4(b).  In  this  architecture,  the  in¬ 
terconnecting  lines  between  adjacent  LUT  memories  deter¬ 
mine  the  position  in  the  EVBDD  (labeled  rails),  and  the 
outputs  from  each  LUT  memory  to  the  adders  tally  the  func¬ 
tion  value  (labeled  Arails).  Consider  the  design  of  the  LUT 
cascade  and  adders  in  Fig.  4(b),  given  the  segmentation  pro¬ 
duced  in  Fig.  2. 

We  begin  by  representing  the  segment  index  function 


(a)  Two  segments. 


(b)  Four  segments. 


Figure  5.  Relationship  between  recursive 
segmentation  and  MTBDDs. 


using  an  MTBDD.  Fig.  5  illustrates  the  relationship  be¬ 
tween  recursive  segmentation  and  MTBDDs.  Then,  we 
convert  the  MTBDD  into  an  EVBDD.  By  decomposing  the 
EVBDD,  as  shown  in  Fig.  6,  we  obtain  the  architecture  in 
Fig.  4(b).  In  Fig.  6,  the  column  labeled  as  ‘r,-’  in  the  table 
of  each  LUT  memory  denotes  the  rails  that  represent  sub¬ 
functions  in  the  EVBDD.  And,  the  column  ‘a,-’  in  Fig.  6  de¬ 
notes  the  Arails  that  represent  the  sum  of  weights  of  edges. 
In  the  EVBDD,  assigned  to  edges  that  cut  across 

the  horizontal  lines  represents  the  sum  of  weights  and  sub¬ 
functions,  respectively.  For  more  detail  on  this  architecture, 
see  [13]. 

In  this  architecture,  the  size  of  LUT  memories  realizing 
the  recursive  segmentation  depends  on  the  number  of  seg¬ 
ments.  Specifically, 


Theorem  1  Let  seg-func{X,Y)  be  a  segment  index  func¬ 
tion  obtained  by  a  recursive  planar  segmentation.  The  seg¬ 
ment  index  function  can  be  realized  by  the  segment  index 
encoder  shown  in  Fig.  4(b)  with  at  most  [ log2  k]  rails  and 
[  log2  k]  Arails,  where  k  is  the  number  of  segments. 


Proof:  See  Appendix. 

In  our  architectures,  the  coefficients  memory  and  the 
LUT  memories  of  the  segment  index  encoder  are  imple¬ 
mented  by  embedded  RAMs  (e.g.  M4Ks  in  Altera  FPGAs). 
Thus,  by  changing  the  data  for  the  coefficients  memory  and 
the  LUT  memories,  a  wide  class  of  two-variable  functions 
can  be  realized  by  a  single  architecture.  Since  just  changing 
the  RAM  data  can  switch  functions,  we  can  switch  func¬ 
tions  without  reprogramming  the  FPGA. 


Figure  6.  Decomposition  of  the  EVBDD. 


5.  Experimental  Results 

5.1.  Number  of  Segments  and  Computation 

Time  for  Algorithms 

Table  1  shows  the  number  of  segments  produced  by  the 
two  segmentation  algorithms  presented  in  Section  3,  and 
their  computation  time  for  various  functions  [1  ].  These  seg¬ 
ments  are  required  to  approximate  two-variable  functions 
by  planar  ( 1  st-order)  polynomials.  In  this  table,  WaveRings, 
Gaussian,  and  Beta  are  defined  as: 

cos  (fX2  +F2^  j  xi 

WaveRines  =  v  /  Gaussian  =  — -=e  ir1 

s/X2  +  Y2  +0.25  YV 2it 

Beta  =  2  f*  sin2*"1  9  cos27"1  9  dQ  - 

Table  1  shows  that,  for  all  functions  except  sin(jtAT),  the 
recursive  segmentation  algorithm  produces  many  fewer  seg¬ 
ments  than  the  uniform  segmentation  algorithm.  Especially, 
for  higher  accuracy,  the  number  of  segments  needed  in  re¬ 
cursive  segmentation  is  only  a  few  percent  of  the  number  of 
segments  needed  in  uniform  segmentation.  For  sin(  ttA'K), 
the  additional  segments  needed  in  uniform  segmentation  are 
not  so  large  even  for  higher  accuracy.  This  means  that,  for 
this  function,  the  uniform  segmentation  method  also  pro¬ 
duces  an  NFG  with  reasonable  size.  In  addition.  Table  1 
shows  that  both  algorithms  produce  segments  with  small 
CPU  time.  Such  quick  segmentation  is  useful  to  reduce  de¬ 
sign  time  for  NFGs. 

5.2.  Memory  Sizes  Needed  for  Numerical 
Function  Generators 

Table  2  compares  total  memory  sizes  needed  for  NFGs 
based  on  the  two  planar  approximation  architectures  shown 
in  Fig.  3.  Note  that  the  NFGs  based  on  recursive  segmenta¬ 
tion  have  two  kinds  of  memories:  coefficients  memory  and 


Table  1.  Number  of  segments  for  two  segmentation  methods  based  on  planar  approximation. 


No. 

Function 

Domain 

Z  and  F  have  8-bit 

accuracy 

X  and  Y  have  12-bit  accuracy 

f(XJ) 

(Acceptable  approx. 

error:  2 

10) 

(Acceptable  approx,  error:  2“ 14 

) 

No.  of  segments 

Rs 

Time  [sec.] 

No.  of  segments 

Rs 

Time  [sec.] 

Recur. 

Uni. 

[%] 

Recur. 

Uni. 

Recur. 

Uni. 

m 

Recur. 

Uni. 

fo 

sin(7tZ)ln(F) 

0<Z<L0<F<1 

4,696 

65,280 

7 

0.19 

0.02 

244,807 

16,773,120 

i 

8.9 

7.2 

h 

sin(jtZ)  y/Y 

0<Z<  1. 0  <  F  <  1 

1,393 

16,384 

9 

0.07 

0.61 

38,773 

16,773,120 

0.2 

1.7 

6.7 

fo 

sin(7tZF) 

0<Z<L0<F<1 

1,486 

4,096 

36 

0.07 

0.19 

26,122 

65,536 

40 

1.2 

3.2 

A 

Z4F5 

0<Z<L0<F<1 

457 

4,096 

11 

0.03 

0.20 

8,179 

262,144 

3 

0.5 

11.1 

fo 

\/s/X2  +  Y2 

0<Z<  1. 0  <  F  <  1 

3,835 

65,025 

6 

0.11 

0.01 

173,552 

16,769,025 

1 

5.0 

5.1 

fo 

ZF/\/Z2+F2 

0<Z<  1. 0  <  F  <  1 

376 

4,096 

9 

0.01 

0.11 

6,523 

1,048,576 

0.6 

0.2 

22.5 

h 

WaveRings 

0<Z<7t,  0<F<n 

1,619 

10,201 

16 

0.08 

0.39 

28,377 

646,416 

4 

1.3 

18.9 

A 

Gaussian 

0<Z<  1,0<F<  1 

3,182 

65,025 

5 

0.12 

0.02 

141,113 

16,769,025 

0.8 

5.5 

7.1 

A 

Vx2  +  y2 

0<Z<  1. 0  <  F  <  1 

355 

4,096 

9 

0.01 

0.12 

6,160 

1,048,576 

0.6 

0.2 

24.6 

A 

v/XT+FJ 

0<Z<  1. 0  <  F  <  1 

400 

16,384 

2 

0.04 

0.76 

6,790 

4,194,304 

0.2 

0.5 

188.8 

ho 

Beta 

1/8  <Z  <  1.  1/8  <F  <  1 

5,815 

50,176 

12 

0.73 

0.05 

187,201 

3,211,264 

6 

24.6 

326.5 

Recur.:  Recursive  segmentation.  Uni.:  Uniform  segmentation.  Rs:  Recur.  /  Uni.  x  100  (%). 

Time:  CPU  time  needed  for  segmentation  algorithm. 

Experiment  environment 

CPU:  Intel  Xeon  2.6GHz  Memory  :  1GB  OS:  Redhat  Linux  C  compiler :  gcc  -02 


Table  2.  Total  memory  sizes  needed  for  NFGs 
based  on  two  planar  approximation  architec¬ 
tures, _ 


No. 

8-bit  accuracy  NFGs 

12-bit  accuracy  NFGs 

Recursive 

Uniform 

Rm 

Recursive 

Uniform 

Rm 

fo 

260,052 

783,360 

33 

16,683,510 

285,143,040 

6 

fl 

59,511 

360,448 

17 

2,030,356 

201,277,440 

1 

fo 

69,352 

1 10,592 

63 

1,313,684 

2,293,760 

57 

fo 

25,392 

102,400 

25 

516,230 

8,126,464 

6 

h 

226,403 

1,040,400 

22 

13,054,030 

402,456,600 

3 

fo 

18,120 

90,112 

20 

369,189 

27,262,976 

1 

h 

100,030 

346,834 

29 

1,886,924 

23,917,392 

8 

fo 

186,980 

910,350 

21 

11,345,482 

368,918,550 

3 

h 

16,882 

94,208 

18 

316.128 

28,311,552 

1 

h 

21,792 

278,528 

8 

405,576 

88,080,384 

0.5 

ho 

291,735 

602,112 

48 

11,814,069 

122,028,032 

10 

Rm:  Recursive  /  Uniform  x  100  (%). 


LUT  memory,  and  thus  their  memory  size  is  the  sum  of  the 
coefficients  memory  size  and  the  LUT  memory  sizes. 

Table  2  shows  that,  for  all  functions,  NFGs  based  on 
recursive  segmentation  require  smaller  memory  size  than 
NFGs  based  on  uniform  segmentation,  even  though  NFGs 
based  on  recursive  segmentation  have  a  segment  index  en¬ 
coder.  For  example,  for  XT / y/X 2  +  Y2,  the  12-bit  accuracy 
NFG  using  recursive  segmentation  requires  only  0.6%  of 
memory  required  by  uniform  segmentation. 

To  understand  the  relation  between  memory  size  and  ac¬ 
curacy,  we  designed  NFGs  for  XY /\/X2  +  Y 2  with  various 
accuracies.  Fig.  7  plots  memory  sizes  of  the  NFGs  for  4  to 
16-bit  accuracies.  There  are  three  curves: 

1 .  a  single  look-up  table  in  which  the  values  assigned  to  X 
and  T  form  an  address  and  the  contents  of  that  address 
is  f(X,Y), 


Figure  7.  Memory  size  versus  accuracy  for 

XY/VX2  +  Y2. 


2.  NFG  with  recursive  non-uniform  segmentation,  and 

3.  NFG  with  uniform  segmentation. 

Interestingly,  for  this  function,  the  memory  size  of  the 
NFGs  based  on  uniform  segmentation  increases  in  the  same 
way  as  memory  size  of  a  single  look-up  table.  On  the  other 
hand,  the  memory  size  of  the  NFGs  based  on  recursive  seg¬ 
mentation  increases  much  more  slowly  than  the  other  two. 
For  16-bit  accuracy,  the  memory  size  of  the  NFG  based  on 
recursive  segmentation  is  only  0.09%  of  that  of  the  NFG 
based  on  uniform  segmentation. 

5.3.  FPGA  Implementation  Results 

We  implemented  8-bit  accuracy  NFGs  based  on 
the  two  architectures  using  the  Altera  Stratix  FPGA 


Table  3.  FPGA  implementation  of  8-bit  accuracy  NFGs  based  on  two  architectures. 


FPGA  device: 

Logic  synthesis  tool: 

Altera  Stratix  (EP1S10F484C7) 
Synplify  Pro  Ver.  8.8 

Function 

f(X,Y) 

Recursive  segmentation 

Uniform  segmentation 

#LEs 

#DSPs 

Freq. 

[MHz] 

#stages 

Delay 

[nsec.] 

#LEs 

#DSPs 

Freq. 

[MHz] 

#stages 

Delay 

[nsec.] 

sin(7tX)ln(F) 

347 

4 

149 

9 

60 

- 

0 

- 

i 

- 

sin(7iX)\/F 

206 

2 

149 

7 

47 

58 

1 

183 

3 

16 

sin(7iXT) 

280 

2 

169 

7 

41 

62 

2 

183 

3 

16 

X4F5 

280 

2 

169 

7 

41 

57 

2 

183 

3 

16 

l/v/X2+F2 

552 

4 

169 

9 

53 

- 

0 

- 

1 

- 

xy/Vx2  +  y2 

180 

2 

169 

6 

35 

56 

2 

183 

3 

16 

WaveRings 

364 

4 

149 

8 

54 

78 

2 

183 

3 

16 

Gaussian 

430 

4 

149 

10 

67 

- 

0 

- 

1 

- 

Vx2  +  y2 

177 

2 

169 

6 

35 

60 

2 

183 

3 

16 

v/XT+FJ 

189 

2 

169 

6 

35 

56 

0 

343 

3 

9 

Beta 

439 

4 

169 

9 

53 

- 

0 

- 

1 

- 

NFGs  cannot  be  mapped  into  the  FPGA  due  to  insufficient  memory  blocks. 

#LEs:  Number  of  logic  elements.  #DSPs:  Number  of  9-bit  x  9-bit  DSP  units. 

Freq.  :  Operating  frequency.  #stages  :  Number  of  pipeline  stages. 


(EP1S10F484C7).  Table  3  compares  the  FPGA  implemen¬ 
tation  results  of  the  two  architectures.  In  this  table,  the 
columns  “Delay”  show  the  total  delay  time  of  each  NFG 
from  the  input  to  the  output,  in  nanoseconds. 

The  NFGs  based  on  uniform  segmentation  require  fewer 
pipeline  stages  and  have  shorter  delay  than  the  recursive 
segmentation  because  they  have  no  segment  index  encoder. 
However,  for  four  functions,  the  NFGs  based  on  uniform 
segmentation  are  not  so  easily  implemented  in  an  FPGA 
due  to  excessive  memory  size.  Table  3  shows  that  they  can¬ 
not  be  mapped  into  the  FPGA  due  to  insufficient  memory 
blocks.  Note  that  NFGs  that  have  only  one  pipeline  stage  in 
Table  3  are  realized  with  a  single  look-up  table  due  to  the 
excessively  many  segments.  On  the  other  hand,  for  all  func¬ 
tions,  the  NFGs  based  on  recursive  segmentation  achieve 
high  operating  frequency. 

It  is  important  to  note  that  certain  two-variable  functions 
can  be  designed  using  1.  one-variable  NFGs  and  2.  basic 
operations  like  addition  and  multiplication.  For  example, 
the  first  function  in  Table  1,  sin (71 A )  ln(T),  can  be  designed 
using  two  one- variable  NFGs,  one  realizing  sinfttA )  and 
the  other  realizing  ln(T).  The  outputs  are  then  multiplied 
together  to  realize  the  two-variable  function.  We  are  then 
interested  in  the  complexity  of  this  realization  compared  to 
the  direct  two-variable  NFG  design  discussed  earlier. 

To  understand  the  relative  merits  of  using  one  versus 
two-variable  NFGs,  we  implemented  the  following  three 
functions  from  Table  1 , 

1.  sin(jtX) ln(F), 

2.  XY/s/X2  +  Y2,  and 

3.  WaveRings 

using  one-variable  NFGs  and  basic  operations.  Each  one- 
variable  NFG  was  realized  by  the  method  shown  in  [13], 


Table  4.  FPGA  implementation  of  8-bit  accu¬ 
racy  NFGs  designed  using  combination  of 
one-variable  NFGs. _ 


FPGA  device: 

Logic  synthesis  tool: 

Altera  Stratix  (EP1S10F484C7) 
Synplify  Pro  Ver.  8.8 

Function 

f(X,Y) 

Memory 

[bits] 

#LEs 

#DSPs 

Freq. 

[MHz] 

#stages 

Delay 

[nsec.] 

sin(jtX)ln(F) 

7,104 

234 

4 

149 

7 

47 

XY/y/X2  +  Y 2 

31,104 

381 

13 

133 

12 

90 

WaveRings 

15,232 

410 

10 

149 

11 

74 

Memory  :  Total  memory  size  needed  for  two- variable  functions. 


which  is  based  on  linear  approximation  and  non-uniform 
segment  lengths.  Table  4  shows  the  results. 

Except  for  sin(7tX) ln(F),  the  direct  two-variable  NFG 
implementation  requires  fewer  logic  elements  (FEs)  and 
DSPs  than  the  one-variable  implementation.  Also,  except 
for  sin(7tX)  ln(T),  the  direct  two-variable  implementations 
have  shorter  delay.  For  XY /  \/A2  +  Y2  and  WaveRings,  the 
delays  of  the  two-variable  implementations  are  only  39% 
and  73%  of  those  of  the  one-variable  implementations,  re¬ 
spectively.  Especially,  in  the  case  of  XY /  \/X 2  +  Y2,  both 
complexity  and  delay  of  the  two-variable  NFG  are  signifi¬ 
cantly  less  than  the  one-variable  NFG  implementation.  For 
example,  the  X  in  the  denominator,  must  be  squared,  added 
to  F2,  the  reciprocal  square  root  taken,  and  then  multiplied 
by  XY .  This  incurs  a  significant  complexity  and  speed 
penalty. 

From  these  results,  we  can  see  that  by  designing  two- 
variable  functions  using  one-variable  NFGs,  the  required 
memory  size  can  be  reduced  significantly.  However,  de¬ 
pending  on  functions,  it  can  produce  a  slow  implementation 
because  of  additional  logic  such  as  multipliers.  Also,  com- 


plicated  hardware  architecture  using  one-variable  NFGs 
makes  error  analysis  harder,  and  it  is  harder  to  guarantee 
output  accuracy.  This  increases  design  time. 

6.  Concluding  Remarks 

We  have  proposed  a  design  method  and  programmable 
architectures  for  numerical  function  generators  of  two- 
variable  functions.  To  realize  a  two- variable  function  in 
hardware,  we  partition  the  given  domain  of  the  function  into 
segments,  and  approximate  the  given  function  by  a  polyno¬ 
mial  in  each  segment.  In  this  paper,  we  presented  two  planar 
segmentation  algorithms  which  partition  a  given  domain  of 
two-variable  function  efficiently.  To  the  best  of  our  knowl¬ 
edge,  this  is  the  first  systematic  design  method  based  on 
piecewise  polynomial  approximation  for  two-variable  func¬ 
tions.  Experimental  results  show  that  for  a  complicated 
function,  our  automatically  generated  NFG  achieves  higher 
performance  than  manually  designed  NFG. 

The  algorithms  and  architectures  presented  in  this  paper 
can  be  easily  extended  to  functions  with  three  or  more  vari¬ 
ables. 
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Appendix 

The  proof  of  Theorem  1  is  based  on  a  theorem  proven 
in  [13].  Specifically,  it  was  shown  that 

Theorem  A  [13]  Let  g(Z)  be  a  k-vcilued  monotone  in¬ 
creasing  function.  The  function  g(Z)  can  be  realized  by 
the  segment  index  encoder  shown  in  Fig.  4(b)  with  at  most 
[ log2  k\  rails  and  [ log2 k]  Arails. 

Theorem  1  Let  seg-func(X,  Y )  be  a  segment  index  func¬ 
tion  obtained  by  a  recursive  planar  segmentation.  The  seg¬ 
ment  index  function  can  be  realized  by  the  segment  index 
encoder  shown  in  Fig.  4(b)  with  at  most  [ log2  k]  rails  and 
[log2k]  Arails,  where  k  is  the  number  of  segments. 

Proof:  By  forming  a  variable 

Z  =  (xi- 1  yi- 1  XI- 2  yi-2  ■  ■  ■  X-m  y-m) 

from  X  and  Y,  seg-func(X,Y)  obtained  by  the  recursive 
planar  segmentation  algorithm  can  be  converted  into  a  k- 
valued  monotone  increasing  function  g(Z).  Therefore,  from 
Theorem  A,  we  have  this  theorem.  I 


